{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "python3",
   "display_name": "DataAnalysis"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nltk\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import brown\n",
    "from tools import show_subtitle\n",
    "%matplotlib inline"
   ]
  },
  {
   "source": [
    "# Ch5 分类和标注词汇\n",
    "\n",
    "1.  什么是词汇分类，在自然语言处理中它们如何使用？\n",
    "2.  对于存储词汇和它们的分类来说什么是好的 Python 数据结构？\n",
    "3.  如何自动标注文本中每个词汇的词类？\n",
    "\n",
    "-   词性标注（parts-of-speech tagging，POS tagging）：简称标注。将词汇按照它们的词性（parts-of-speech，POS）进行分类并对它们进行标注\n",
    "-   词性：也称为词类或者词汇范畴。\n",
    "-   标记集：用于特定任务标记的集合。"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "## 5.3 使用Python字典映射词及其属性(P206)\n",
    "Python字典数据类型（以称为关联数组或者哈希数组），学习如何使用字典表示包括词性在内的各种不同语言信息"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 5.3.1 索引链表 与 字典 的区别\n",
    "\n",
    "图5-2：链表查找：在整数索引的基础上，访问 Python 链表的内容\n",
    "\n",
    "图5-3：字典查询：使用一个关键字，访问一个字典的条目\n",
    "\n",
    "表5-4：语言学对象从键到值的映射"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "source": [
    "### 5.3.2. Python字典"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos['colorless']=  ADJ\npos=  {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\n"
     ]
    }
   ],
   "source": [
    "pos = {}\n",
    "pos['colorless'] = 'ADJ'\n",
    "pos['ideas'] = 'N'\n",
    "pos['sleep'] = 'V'\n",
    "pos['furiously'] = 'ADV'\n",
    "print(\"pos['colorless']= \", pos['colorless'])\n",
    "print(\"pos= \", pos)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 访问不存在的键，报 KeyError\n",
    "# pos['green']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "list(pos)=  ['colorless', 'ideas', 'sleep', 'furiously']\n"
     ]
    }
   ],
   "source": [
    "# 字典转换成链表\n",
    "print(\"list(pos)= \", list(pos))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "sorted(pos)=  ['colorless', 'furiously', 'ideas', 'sleep']\n"
     ]
    }
   ],
   "source": [
    "# 字典排序\n",
    "print(\"sorted(pos)= \", sorted(pos))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "word_list=  ['colorless', 'ideas']\n"
     ]
    }
   ],
   "source": [
    "# 字典顺序访问\n",
    "word_list = [\n",
    "        w\n",
    "        for w in pos\n",
    "        if w.endswith('s')\n",
    "]\n",
    "print(\"word_list= \", word_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "colorless: ADJ\nideas: N\nsleep: V\nfuriously: ADV\n"
     ]
    }
   ],
   "source": [
    "# 遍历字典中的数据\n",
    "# for word in sorted(pos):\n",
    "for word in pos:\n",
    "    print(word + \":\", pos[word])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "键=  dict_keys(['colorless', 'ideas', 'sleep', 'furiously'])\n值=  dict_values(['ADJ', 'N', 'V', 'ADV'])\n对=  dict_items([('colorless', 'ADJ'), ('ideas', 'N'), ('sleep', 'V'), ('furiously', 'ADV')])\n"
     ]
    }
   ],
   "source": [
    "# 访问字典的方法\n",
    "print(\"键= \", pos.keys())\n",
    "print(\"值= \", pos.values())\n",
    "print(\"对= \", pos.items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "colorless: ADJ\nideas: N\nsleep: V\nfuriously: ADV\n"
     ]
    }
   ],
   "source": [
    "# 分开获取字典中条目的键和值\n",
    "# for key, val in sorted(pos.items()):\n",
    "for key, val in pos.items():\n",
    "    print(key + \":\", val)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos['sleep']=  V\npos['sleep']=  N\n"
     ]
    }
   ],
   "source": [
    "# 字典中键必须惟一\n",
    "pos['sleep'] = 'V'\n",
    "print(\"pos['sleep']= \", pos['sleep'])\n",
    "pos['sleep'] = 'N'\n",
    "print(\"pos['sleep']= \", pos['sleep'])"
   ]
  },
  {
   "source": [
    "### 5.3.3. 定义字典（创建字典的两种方式）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos=  {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\npos=  {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\n"
     ]
    }
   ],
   "source": [
    "pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\n",
    "print(\"pos= \", pos)\n",
    "pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')\n",
    "print(\"pos= \", pos)"
   ]
  },
  {
   "source": [
    "### 5.3.4. 默认字典（字典创建新键时的默认值）"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "frequency=  defaultdict(<class 'int'>, {'colorless': 4})\n",
      "frequency['colorless']=  4\n",
      "frequency['ideas']=  0\n",
      "list(frequency.items())=  [('colorless', 4), ('ideas', 0)]\n"
     ]
    }
   ],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "# 默认值可以是不变对象\n",
    "frequency = defaultdict(int)  \n",
    "frequency['colorless'] = 4\n",
    "print(\"frequency= \", frequency)\n",
    "print(\"frequency['colorless']= \", frequency['colorless'])\n",
    "# 访问不存在的键时，自动创建，使用定义的默认值\n",
    "print(\"frequency['ideas']= \", frequency['ideas'])  \n",
    "print(\"list(frequency.items())= \", list(frequency.items()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos = defaultdict(<class 'list'>, {'sleep': ['NOUN', 'VERB']})\npos['sleep']=  ['NOUN', 'VERB']\npos['ideas']=  []\nlist(pos.items())=  [('sleep', ['NOUN', 'VERB']), ('ideas', [])]\n"
     ]
    }
   ],
   "source": [
    "# 默认值也可以是可变对象\n",
    "pos = defaultdict(list)  \n",
    "pos['sleep'] = ['NOUN', 'VERB']\n",
    "print(\"pos =\", pos)\n",
    "print(\"pos['sleep']= \", pos['sleep'])\n",
    "print(\"pos['ideas']= \", pos['ideas'])\n",
    "print(\"list(pos.items())= \", list(pos.items()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "oneObject._data=  5\n",
      "twoObject._data=  0\n",
      "pos['ideas']=  <__main__.myObject object at 0x000000000C192240>\n",
      "list(pos.items())=  [('sleep', <__main__.myObject object at 0x000000000C1922E8>), ('ideas', <__main__.myObject object at 0x000000000C192240>)]\n",
      "pos['sleep']._data=  5\n",
      "pos['ideas']._data=  0\n"
     ]
    }
   ],
   "source": [
    "# 默认值为自定义对象\n",
    "class myObject():\n",
    "    def __init__(self, data=0):\n",
    "        self._data = data\n",
    "        return\n",
    "\n",
    "\n",
    "oneObject = myObject(5)\n",
    "print(\"oneObject._data= \", oneObject._data)\n",
    "twoObject = myObject()\n",
    "print(\"twoObject._data= \", twoObject._data)\n",
    "\n",
    "pos = defaultdict(myObject)\n",
    "pos['sleep'] = myObject(5)\n",
    "print(\"pos['ideas']= \", pos['ideas'])\n",
    "print(\"list(pos.items())= \", list(pos.items()))\n",
    "print(\"pos['sleep']._data= \", pos['sleep']._data)\n",
    "print(\"pos['ideas']._data= \", pos['ideas']._data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos['colorless']=  ADJ\n",
      "pos['blog']=  NOUN\n",
      "list(pos.items())=  [('colorless', 'ADJ'), ('blog', 'NOUN')]\n"
     ]
    }
   ],
   "source": [
    "# 默认 lambda 表达式\n",
    "pos = defaultdict(lambda: 'NOUN')\n",
    "pos['colorless'] = 'ADJ'\n",
    "print(\"pos['colorless']= \", pos['colorless'])\n",
    "print(\"pos['blog']= \", pos['blog'])\n",
    "print(\"list(pos.items())= \", list(pos.items()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "list(mapping.items())[:20]=  [(',', 1993), (\"'\", 1731), ('the', 1527), ('and', 802), ('.', 764), ('to', 725), ('a', 615), ('I', 543), ('it', 527), ('she', 509), ('of', 500), ('said', 456), (\",'\", 397), ('Alice', 396), ('in', 357), ('was', 352), ('you', 345), (\"!'\", 278), ('that', 275), ('as', 246)]\nalice2[:20]=  [3, 396, 1731, 195, 3, 357, 3, 55, 'UNK', 'UNK', 'UNK', 'UNK', 12, 543, 764, 3, 1527, 45, 141, 'UNK']\n"
     ]
    }
   ],
   "source": [
    "# 使用 UNK(out of vocabulary)（超出词汇表）标识符来替换低频词汇\n",
    "alice = nltk.corpus.gutenberg.words('carroll-alice.txt')\n",
    "vocab = nltk.FreqDist(alice)\n",
    "v1000 = [\n",
    "        word\n",
    "        for (word, _) in vocab.most_common(1000)\n",
    "]\n",
    "mapping = defaultdict(lambda: 'UNK')\n",
    "for v in v1000:\n",
    "    mapping[v] = vocab[v]\n",
    "print(\"list(mapping.items())[:20]= \", list(mapping.items())[:20])\n",
    "alice2 = [\n",
    "        mapping[v]\n",
    "        for v in alice\n",
    "]\n",
    "print(\"alice2[:20]= \", alice2[:20])"
   ]
  },
  {
   "source": [
    "### 5.3.5. 递增地更新字典"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "counts['NOUN']=  30654\nsorted(counts)=  ['.', 'ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRON', 'PRT', 'VERB', 'X']\ncounts=  defaultdict(<class 'int'>, {'DET': 11389, 'NOUN': 30654, 'ADJ': 6706, 'VERB': 14399, 'ADP': 12355, '.': 11928, 'ADV': 3349, 'CONJ': 2717, 'PRT': 2264, 'PRON': 2535, 'NUM': 2166, 'X': 92})\n"
     ]
    }
   ],
   "source": [
    "# Ex5-3 递增地更新字典，按值排序\n",
    "counts = nltk.defaultdict(int)\n",
    "for (word, tag) in nltk.corpus.brown.tagged_words(categories='news', tagset='universal'):\n",
    "    counts[tag] += 1\n",
    "print(\"counts['NOUN']= \", counts['NOUN'])\n",
    "print(\"sorted(counts)= \", sorted(counts))\n",
    "print(\"counts= \", counts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "sort_keys=  [('.', 11928), ('ADJ', 6706), ('ADP', 12355), ('ADV', 3349), ('CONJ', 2717), ('DET', 11389), ('NOUN', 30654), ('NUM', 2166), ('PRON', 2535), ('PRT', 2264), ('VERB', 14399), ('X', 92)]\nsort_keys=  [('X', 92), ('NUM', 2166), ('PRT', 2264), ('PRON', 2535), ('CONJ', 2717), ('ADV', 3349), ('ADJ', 6706), ('DET', 11389), ('.', 11928), ('ADP', 12355), ('VERB', 14399), ('NOUN', 30654)]\nsort_keys=  [('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)]\nkey_list=  ['NOUN', 'VERB', 'ADP', '.', 'DET', 'ADJ', 'ADV', 'CONJ', 'PRON', 'PRT', 'NUM', 'X']\n"
     ]
    }
   ],
   "source": [
    "from operator import itemgetter\n",
    "\n",
    "# IndexError: tuple index out of range\n",
    "sort_keys = sorted(counts.items(), key=itemgetter(0), reverse=False)\n",
    "print(\"sort_keys= \", sort_keys)\n",
    "sort_keys = sorted(counts.items(), key=itemgetter(1), reverse=False)\n",
    "print(\"sort_keys= \", sort_keys)\n",
    "sort_keys = sorted(counts.items(), key=itemgetter(1), reverse=True)\n",
    "print(\"sort_keys= \", sort_keys)\n",
    "# itemgetter(2) 没有这个选项，没法用于排序\n",
    "# sort_keys = sorted(counts.items(), key=itemgetter(2), reverse=False)\n",
    "# print(\"sort_keys= \", sort_keys)\n",
    "key_list = [\n",
    "        t\n",
    "        for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)\n",
    "]\n",
    "print(\"key_list= \", key_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pair=  ('NP', 8336)\npair[1]=  8336\nitemgetter(0)(pair)=  NP\nitemgetter(1)(pair)=  8336\n"
     ]
    }
   ],
   "source": [
    "pair = ('NP', 8336)\n",
    "print(\"pair= \", pair)\n",
    "print(\"pair[1]= \", pair[1])\n",
    "print(\"itemgetter(0)(pair)= \", itemgetter(0)(pair))\n",
    "print(\"itemgetter(1)(pair)= \", itemgetter(1)(pair))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "last_letters['ly']=  ['abactinally', 'abandonedly', 'abasedly', 'abashedly', 'abashlessly', 'abbreviately', 'abdominally', 'abhorrently', 'abidingly', 'abiogenetically', 'abiologically', 'abjectly', 'ableptically', 'ably', 'abnormally', 'abominably', 'aborally', 'aboriginally', 'abortively', 'aboundingly']\nlast_letters['xy']=  ['acyloxy', 'adnexopexy', 'adoxy', 'agalaxy', 'alkoxy', 'alkyloxy', 'amidoxy', 'anorexy', 'anthotaxy', 'apoplexy', 'apyrexy', 'asphyxy', 'ataraxy', 'ataxy', 'azoxy', 'bandboxy', 'barotaxy', 'benzoxy', 'biotaxy', 'boxy']\n"
     ]
    }
   ],
   "source": [
    "# 通过最后两个字母索引词汇\n",
    "last_letters = defaultdict(list)\n",
    "words = nltk.corpus.words.words('en')\n",
    "for word in words:\n",
    "    key = word[-2:]\n",
    "    last_letters[key].append(word)\n",
    "\n",
    "print(\"last_letters['ly']= \", last_letters['ly'][:20])\n",
    "print(\"last_letters['xy']= \", last_letters['xy'][:20])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "anagrams['aeilnrt']=  ['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']\nanagrams['kloo']=  ['kolo', 'look']\nanagrams['Zahity']=  ['Zythia']\nanagrams[''.join(sorted('love'))]=  ['levo', 'love', 'velo', 'vole']\n"
     ]
    }
   ],
   "source": [
    "# 颠倒字母而成的字（回文构词法，相同字母异序词，易位构词，变位词）索引词汇\n",
    "anagrams = defaultdict(list)\n",
    "for word in words:\n",
    "    key = ''.join(sorted(word))\n",
    "    anagrams[key].append(word)\n",
    "print(\"anagrams['aeilnrt']= \", anagrams['aeilnrt'])\n",
    "print(\"anagrams['kloo']= \", anagrams['kloo'])\n",
    "print(\"anagrams['Zahity']= \", anagrams['Zahity'])\n",
    "print(\"anagrams[''.join(sorted('love'))]= \", anagrams[''.join(sorted('love'))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "anagrams['aeilnrt']=  ['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']\n",
      "anagrams.most_common(20)=  [('agnor', 9), ('acert', 9), ('eerst', 9), ('aelrst', 8), ('aelpt', 8), ('adelnr', 7), ('aelm', 7), ('aelrt', 7), ('aeglr', 7), ('ailr', 7), ('airst', 7), ('aemrt', 7), ('aenprt', 7), ('aeerst', 7), ('aelt', 7), ('aderrt', 7), ('adert', 7), ('aeginst', 7), ('aelps', 7), ('aelst', 7)]\n"
     ]
    }
   ],
   "source": [
    "# NLTK 提供的创建 defaultdict(list) 更加简便的方法\n",
    "# nltk.Index() 是对 defaultdict(list) 的支持\n",
    "# nltk.FreqDist() 是对 defaultdict(int) 的支持（附带了排序和绘图的功能）\n",
    "anagrams = nltk.Index((''.join(sorted(w)), w) for w in words)\n",
    "print(\"anagrams['aeilnrt']= \", anagrams['aeilnrt'])\n",
    "\n",
    "anagrams = nltk.FreqDist(''.join(sorted(w)) for w in words)\n",
    "print(\"anagrams.most_common(20)= \", anagrams.most_common(20))"
   ]
  },
  {
   "source": [
    "### 5.3.6. 复杂的键和值"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos[('DET', 'right')]=  defaultdict(<class 'int'>, {'NOUN': 5, 'ADJ': 11})\npos[('NOUN', 'further')]=  defaultdict(<class 'int'>, {'ADV': 2})\npos[('PRT', 'book')]=  defaultdict(<class 'int'>, {})\n--------------- >pos< ---------------\nDET Fulton\nNOUN County\nNOUN Grand\nADJ Jury\nNOUN said\nVERB Friday\nNOUN an\nDET investigation\nNOUN of\nADP Atlanta's\nNOUN recent\nADJ primary\nNOUN election\nNOUN produced\nVERB ``\n. no\nDET evidence\nNOUN ''\n. that\nADP any\n"
     ]
    }
   ],
   "source": [
    "# 使用复杂的键和值的默认字典\n",
    "pos = defaultdict(lambda: defaultdict(int))\n",
    "brown_news_tagged = nltk.corpus.brown.tagged_words(categories='news', tagset='universal')\n",
    "for ((w1, t1), (w2, t2)) in nltk.bigrams(brown_news_tagged):\n",
    "    pos[(t1, w2)][t2] += 1\n",
    "\n",
    "print(\"pos[('DET', 'right')]= \", pos[('DET', 'right')])\n",
    "print(\"pos[('NOUN', 'further')]= \", pos[('NOUN', 'further')])\n",
    "print(\"pos[('PRT', 'book')]= \", pos[('PRT', 'book')])\n",
    "show_subtitle(\"pos\")\n",
    "for i, (key, value) in enumerate(pos):\n",
    "    if i<20:\n",
    "        print(key,value)"
   ]
  },
  {
   "source": [
    "### 5.3.7. 颠倒字典\n",
    "\n",
    "表5-5：Python 字典的常用方法"
   ],
   "cell_type": "markdown",
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "key_list=  ['mortal', 'Against', 'Him', 'There', 'brought', 'King', 'virtue', 'every', 'been', 'thine']\n"
     ]
    }
   ],
   "source": [
    "# 通过键查值速度很快，但是通过值查键的速度较慢，为也加速查找可以重新创建一个映射值到键的字典\n",
    "counts = defaultdict(int)\n",
    "for word in nltk.corpus.gutenberg.words('milton-paradise.txt'):\n",
    "    counts[word] += 1\n",
    "\n",
    "# 通过值查键的一种方法\n",
    "key_list = [\n",
    "        key\n",
    "        for (key, value) in counts.items()\n",
    "        if value == 32\n",
    "]\n",
    "print(\"key_list= \", key_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos=  {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\npos2['N']=  ideas\n"
     ]
    }
   ],
   "source": [
    "# 使用键-值对字典创建值-键对字典\n",
    "# pos 是键-值对字典；pos2 是值-键对字典\n",
    "pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}\n",
    "print(\"pos= \", pos)\n",
    "pos2 = dict(\n",
    "        (value, key)\n",
    "        for (key, value) in pos.items()\n",
    ")\n",
    "print(\"pos2['N']= \", pos2['N'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos=  {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV', 'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'}\npos2['ADV']=  ['furiously', 'peacefully']\n"
     ]
    }
   ],
   "source": [
    "# 一个键有多个值的键-值字典不能使用上面的方法创建值-键字典\n",
    "# 提供了一个新的方法创建值-键对字典\n",
    "pos.update({'cats': 'N', 'scratch': 'V', 'peacefully': 'ADV', 'old': 'ADJ'})\n",
    "print(\"pos= \", pos)\n",
    "pos2 = defaultdict(list)\n",
    "for key, value in pos.items():\n",
    "    pos2[value].append(key)\n",
    "print(\"pos2['ADV']= \", pos2['ADV'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": [
      "pos2['ADV']=  ['furiously', 'peacefully']\n"
     ]
    }
   ],
   "source": [
    "# 使用 nltk.Index() 函数创建新的值-键对字典\n",
    "pos2 = nltk.Index(\n",
    "        (value, key)\n",
    "        for (key, value) in pos.items()\n",
    ")\n",
    "print(\"pos2['ADV']= \", pos2['ADV'])"
   ]
  }
 ]
}